Efficient incremental method for data mining of a database

ABSTRACT

A method for discovering association rules in an electronic database commonly known as data mining. A database is divided into a plurality of sections, and each section is sequentially scanned, the results of the previous scan being taken into consideration in a current scanned partition. Three algorithms are further developed on this basis that deal with incremental mining, mining general temporal association rules, and weighted association rules in a time-variant database.

BACKGROUND OF INVENTION

[0001] 1. Field of the Invention

[0002] The present invention relates to efficient techniques for thedata mining of the information databases.

[0003] 2. Description of Related Art

[0004] The ability to collect huge amounts of data, and the low cost ofcomputing power has given rise to enhanced automatic analysis of thisdata referred to as data mining. The discovery of associationrelationships within the databases is useful in selective marketing,decision analysis, and business management. A popular area ofapplications is the market basket analysis, which studies the buyingbehaviors of customers by searching for sets of items that arefrequently purchased together or in sequence. Typically, the process ofdata mining is user controlled through thresholds, support andconfidence parameters, or other guides to the data mining process. Manyof the methods for mining large databases were introduced in “MiningAssociation Rules between Sets of Items in Large Databases,” R. Agrawaland R. Srikant (Proc. 1993 ACM SIGMOD Intl. Conf on Management of Data,pp. 207-216, Wash., D.C., May 1993.). In that paper, it was shown thatthe problem of mining association rules is composed of the following twosubproblems: discovering the frequent itemsets, i.e., all sets ofitemsets that have transaction support above a pre-determined minimumsupport s, and using the frequent itemsets to generate the associationrules for the database. The overall performance of mining associationrules is in fact determined by the first subproblem. After the frequentitemsets are identified, the corresponding association rules can bederived in a straightforward manner. Previous algorithms include Apriori(R. Agrawal, T. Imileinski, and A. Swani. Mining association Rulesbetween Sets of Items in Large Databases. Proc. Of ACM SIGMOD, pages207-216, May 1993), TreeProjection (R. Agarwal, C. Aggarwal, and VVVPrasad. A Tree Projection Algorithm for Generation of Frequent Itemsets.Journal of Parallel and Distributed Computing (Special Issue on HighPerformance Data Mining), 2000.), and FP-tree (J. Han, J. Pei, B.Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequentpattern projected sequential pattern mining. Proc. Of 2000 Int. Conf onKnowledge Discovery and Data Mining, pages 355-359, August 2000.).

[0005] To better understand the invention, a brief overview of typicalassociation rules and their derivation is provided. Let I={x₁, x₂, . . ., x_(m)} be a set of items. As set X⊂I with k=|X| is called a k-itemsetor simply an itemset. Let a database D be a set of transactions, whereeach transaction T is a set of items such that X⊂I. A transaction T issaid to support X if and only if X⊂I. Conventionally, an associationrule is an implication of the form X

Y, meaning that the presence of the set X implies the presence ofanother set Y where X⊂I, Y⊂I, and X∩Y=φ. The rule X

Y holds in the transaction set D with confidence c if c% of transactionsin D that contain X also contain Y The rule X

Y has support s in the transaction set D if s% of transactions in Dcontain X∪Y.

[0006] For a given pair of confidence and support thresholds, theproblem of mining association rules is to identify all association rulesthat have confidence and support greater than the corresponding minimumsupport threshold (denoted as s) and minimum confidence threshold(denoted as min_conf). Association rule mining algorithms work in twosteps: generate all frequent itemsets that satisfy s, and generate allassociation rules that satisfy min_conf using the frequent itemsets.This problem can be reduced to the problem of finding all frequentitemsets for the same support threshold. As mentioned a broad variety ofefficient algorithms for mining association rules have been developed inrecent years including algorithms based on the level-wise Aprioriframework, TreeProjection, and FP-growth algorithms. However thesealgorithms still in many cases have high processing times leading toincreased I/O and CPU costs, and cannot effectively be applied to themining of a publication-like database which is of increasing popularity.An FUP algorithm updates the association rules in a database when newtransactions are added to the database. Algorithm FUP is based on theframework of Apriori and is designed to discover the new frequentitemsets iteratively. The idea is to store the counts of all thefrequent itemsets found in a previous mining operation. Using thesestored counts and examining the newly added transactions, the overallcount of these candidate itemsets are then obtained by scanning theoriginal database. An extension to the work in FUP₂ for updating theexisting association rules when transactions are added to and deletedfrom the database. In essence, FUP₂ is equivalent to FUP for the case ofinsertion, and is, however, a complementary algorithm of FUP for thecase of deletion. It is shown that FUP₂ outperforms Apriori algorithmwhich, without many provision for incremental mining, has to re-run theassociation rule mining algorithm on the whole updated database. AnotherFUP-base algorithm, called FUP₂H was also devised to utilize the hashtechnique for performance improvement. Furthermore, the concept ofnegative borders and that of UWEP, i.e. update with early pruning, areutilized to enhance the efficiency of FUP-based algorithms. However, aswill be shown by our experimental results the above mentioned FUP-basedalgorithms tend to suffer from two inherent problems, namely theoccurrence of a potentially huge set of candidate itemsets, and the needfor multiple scans of database. First, consider the problem of apotentially huge set of candidate itemsets. Note that the FUP-basedalgorithms deal with the combination of two sets of candidate itemsetswhich are independently generated, i.e., from the original data set andthe incremental data subset. Since the set of candidate itemsetsincludes all the possible permutations of the elements, FUP-basedalgorithms may suffer from a very large set of candidate itemsets,especially from candidate 2-itemsets. This problem becomes even moresevere for FUP-based algorithms when the incremented portion of theincremental mining is large. More importantly, in many applications, onemay encounter new itemsets in the incremented dataset. While adding somenew products in the transaction database, FUP-based algorithms in theworst case. That is, the case of k=8 means that the database has to bescanned 8 times, which is very costly, especially is terms of I/O cost.As will become clear later, the problem of a large set of candidateitemsets will hinder an effective use of the scan reduction technique byan FUP-based algorithm.

[0007] The prior algorithms have many limitations when mining apublication database as shown in FIG. 1. In essence, a publicationdatabase is a set of transactions where each transaction T is a set ofitems of which each item contains an individual exhibition period. Thecurrent model of association rule mining is not able to handle thepublication database due to the following fundamental problems: lack ofconsideration of the exhibition period of each individual item, and lackof equitable support counting basis for each item.

[0008] In considering the example transaction database in FIG. 2 we seea further limitation of the prior art. Note that db^(i,j) is the part ofthe transaction database formed by a continuous region from partitionP_(l) to partition P_(j). Suppose we have conducted the mining for thetransaction database db^(l,j). As time advances, we are given the newdata of January of 2001, and are interested in conducting an incrementalmining against the new data. Instead of taking all the past data intoconsideration, our interest is limited to mining the data in the last 12months. As a result, the mining of the transaction database db^(l+l,j+1)is called for. Note that since the underlying transaction database hasbeen changed as time advances, some algorithms, such as Apriori, mayhave to resort to the regeneration of candidate itemsets for thedetermination of new frequent itemsets, which is, however, very costlyeven if the incremental data subset is small. On the other hand,FP-tree-based mining methods are likely to suffer from serious memoryoverhead problems since a portion of database is dept in main memoryduring their execution. While FP-tree-based methods are shown to beefficient for small databases, it is expected that such a deficiency ofmemory overhead will become even more severe in the presence of a largedatabase upon which an incremental mining process is usually performed.

[0009] A time-variant database as shown in FIG. 3, consists of values orevents varying with time. Time-variant databases are popular in manyapplication, such as daily fluctuations of a stock market, traces of adynamic production process, scientific experiments, medical treatments,weather records, to name a few. The existing model of theconstraint-based association rule mining is not able to efficientlyhandle the time-variant database due to two fundamental problems, i.e.,(1) lack of consideration of the exhibition period of each individualtransaction; (2) lack of an intelligent support counting basis for eachitem. Note that the traditional mining process treats transactions indifferent time periods indifferently and handles them along the sameprocedure. However, since different transactions have differentexhibition periods in a time-variant database, only considering theoccurrence count of each item might not lead to interesting miningresults.

[0010] Therefore, a need exists for a data mining methods that addressthe limitations of the prior methods as described hereinabove.

SUMMARY OF THE INVENTION

[0011] These and other features, which characterize the invention, areset forth in the claims annexed hereto and forming a further part hereofHowever, for a better understanding of the invention, and of theadvantages and objectives attained through its use, reference should bemade to the drawings, and to the accompanying descriptive matter, inwhich there is described exemplary embodiments of the invention.

[0012] It is one object of the invention to provide a pre-processingalgorithm with cumulative filtering and scan reduction techniques toreduce I/O and CPU costs.

[0013] It is also an object of the invention to provide an algorithmwith effective partitioning of a data space for efficient memoryutilization.

[0014] It is a further object of the invention for provide an algorithmfor efficient incremental mining for an ongoing time-variant transactiondatabase.

[0015] It is another object of the invention to provide an algorithm forthe efficient mining of a publication-like transaction database.

[0016] It is yet a further object of the invention to provide analgorithm for with weighted association rules for a time-variantdatabase.

[0017] A pre-processing algorithm forms the basis of this disclosure. Adatabase is divided into a plurality of partitions. Each partition isthen scanned for 2-itemset candidates. In addition, each potentialcandidate itemset is given two attributes: c.start which contains thepartition number of the corresponding starting partition when theitemset was added to an accumulator, and c.count which contains thenumber of occurrences of the itemset since the itemset was added to theaccumulator. A partial minimal support is then developed called thefiltering threshold. Itemsets whose occurrence is below the filteringthreshold are removed. The remaining candidate itemsets are then carriedover to the next phase for processing. This pre-processing algorithmforms the basis for the following three algorithms.

[0018] To deal with the mining of general temporal association rules, anefficient first algorithm is devised. The basic idea of the firstalgorithm is to first partition a publication database in light ofexhibition periods of items and then progressively accumulate theoccurrence count of each candidate 2-itemset based on the intrinsicpartitioning characteristics. The algorithm is also designed to employ afiltering threshold in each partition to early prune out thosecumulatively infrequent 2-itemsets.

[0019] A second algorithm is further disclosed for incremental mining ofassociation rules. In essence, by partitioning a transaction databaseinto several partitions, and employs a filtering threshold in eachpartition to deal with the candidate itemset generation. In the secondalgorithm the cumulative information in the prior phases is selectivelycarried over towards the generation of candidate itemsets in thesubsequent phases. After the processing of a phase, the algorithmoutputs a cumulative filter, denoted by DF, which consists of aprogressive candidate set of itemsets, their occurrence counts and thecorresponding partial support required. The cumulative filter asproduced in each processing phase constitutes the key component torealize the incremental mining.

[0020] The third algorithm performs mining in a time-variant database.The importance of each transaction period is first reflected by a properweight assigned by the user. Then the algorithm partitions thetime-variant database in light of weighted periods of transactions andperforms weighted mining. The algorithm is designed to progressivelyaccumulate the itemset counts based on the intrinsic partitioningcharacteristics and employ a filtering threshold in each partition toearly prune out those cumulatively infrequent 2-itemsets. With thisdesign, the algorithm is able to efficiently produce weightedassociation rules for applications where different time periods areassigned with different weights and lead to results of more interest.

BRIEF DESCRIPTION OF DRAWINGS

[0021]FIG. 1 shows an illustrative publication database

[0022]FIG. 2 shows an ongoing time-variant transaction database

[0023]FIG. 3 shows a time-variant transaction database

[0024]FIG. 4 shows a block diagram of a data mining system

[0025]FIG. 5 shows an illustrative transaction database andcorresponding item information

[0026]FIGS. 6a-c show frequent temporal itemsets generation for mininggeneral temporal association rules with the first algorithm

[0027]FIG. 7 shows a flowchart for the first algorithm

[0028]FIG. 8 shows the second illustrative transaction database

[0029]FIG. 9a-b show large itemsets generation for the incrementalmining with the second algorithm

[0030]FIG. 10 shows a flowchart for the second algorithm

[0031]FIG. 11 shows the third illustrative database

[0032]FIGS. 12a-c show the generation of frequent itemsets using thethird algorithm

[0033]FIG. 13 shows a flowchart for the third algorithm

DETAILED DESCRIPTION

[0034] In the following detailed description of the preferredembodiments, reference is made to the accompanying drawings which form apart hereof, and in which is shown by way of illustration specificpreferred embodiments in which the invention may be practiced. Thepreferred embodiments are described in sufficient detail to enable theseskilled in the art to practice the invention, and it is to be understoodthat other embodiments may be utilized and that logical, changes may bemade without departing from the spirit and scope of the presentinvention. The following detailed description is, therefore, not to betaken in a limiting sense, and the scope of the present invention isdefined only be the appended claims.

[0035] The present invention relates to an algorithm for data mining.The invention is implemented in a computer system of the type asillustrated in FIG. 1. The computer system 10 consists of a CPU 11, andplurality of storage disks 12, a memory buffer 15, and applicationsoftware 16. Processor 11 applies the data mining algorithm application16 to information retrieved from the permanent storage locations 12,using memory buffers 15 to store the data in the process. While datastorage is illustrated as originating from the storage disks 12, thedata can alternatively come from other sources such as the internet.

[0036] A pre-processing algorithm is presented that forms the basis ofthree later algorithms: the first algorithm to discover general temporalassociation rules in a publication database, the second for theincremental mining of association rules, and the third algorithm fortime-constraint mining on a time-variant database. The pre-processingalgorithm operates by segmenting a database into a plurality ofpartitions. Each partition is then scanned sequentially for thegeneration of candidate 2-itemsets in the first scan of the database. Inaddition, each potential candidate itemset C∈C₂ has two attributesc.start which contains the identity of the starting partition when c wasadded to C₂, and c.count which contains the number of occurrences of csince c was added to C₂. A filtering threshold is then developed anditemsets whose occurrence counts are below the filtering threshold areremoved. The remaining candidate itemsets are then carried over to thenext phase of processing. After generating C₂ from the first scan ofdatabase db^(1,3), we employ the scan reduction technique and use C₂ togenerate C_(k) (k=2, 3, . . . , m), where C_(m) is the candidatelast-itemsets. Clearly a C₃′ generated from C₂*C₂, instead of fromL₂*L₂, will have a size greater than |C₃| where C₃ is generated fromL₂*L₂. However, since the |C₂| generated by the algorithm is very closeto the theoretical minimum, i.e., |L₂|, the |C₃′| is not much largerthan |C₃|. Similarly, the |C_(k)′| is close to |C_(k)|. All C_(k)′ canbe stored in main memory, and we can find L_(k) (k=1, 2, . . . , n)together when the second scan of the database db^(1,3) is performed.Thus only two scans of the original database db^(1,3) are required inthe preprocessing step. An example of algorithm is shown below (whichforms the basis of the next three described algorithms):

[0037] db^(1,n)=The partial database of D formed by a continuous regionfrom P_(l) to P_(n)

[0038] I=itemset

[0039] s=minimum support required

[0040] n=number of partitions;

[0041] CF=cumulative filter

[0042] P=partition

[0043] C=set of progressive candidate itemsets generated by databasedb^(l,j)

[0044] L=determined frequent itemset${{1.\quad {{db}^{1,n}}} = {\sum\limits_{{k = 1},n}^{\quad}\quad {P_{k}}}};$

[0045] 2. CF=0;

[0046] 3. begin for k=1 to n //1^(st) scan of db^(1,n)

[0047] 4. begin for each 2-itemset I∈P_(k)

[0048] 5. if (I∉CF)

[0049] 6. I.count=N_(pk)(I);

[0050] 7. I.start=k;

[0051] 8. if (I.count≧s*|P_(k)|)

[0052] 9. CF=CF∪I;

[0053] 10. if (I∈CF)

[0054] 11. I.count=I.count+N_(pk)(I);

[0055] 12.${if}\left( {{I.{count}} < \left\lceil {s*{\sum\limits_{{m = {I.{start}}},k}^{\quad}\quad {P_{m}}}} \right\rceil} \right)$

[0056] 13. CF=CF−I;

[0057] 14. end

[0058] 15. end

[0059] 16. select C₂ from I where I∈CF

[0060] 17. begin while (C_(k)≠0)

[0061] 18. C_(k+1)=C_(k)*C_(k)

[0062] 19. k=k+1;

[0063] 20. end

[0064] 21. begin for k=1 to n //2^(nd) scan of db^(1,n)

[0065] 22. for each itemset I∈C_(k)

[0066] 23. I.count=I.count+N_(pk)(I);

[0067] 24. end

[0068] 25. for each itemset I∈C_(k)

[0069] 26. if (I.count≧┌s*|db^(1,n)|┐)

[0070] 27. L_(k)=L_(k)∪I;

[0071] 28. end

[0072] This pre-processing algorithm forms the basis of the followingthree algorithms.

[0073] In order to discover general temporal association rules in apublication database, the first algorithm is used. In essence, apublication database is a set of transactions where each transaction Tis a set of items of which each item contains an individual exhibitionperiod. The current model of association rule mining is not able tohandle the publication database due to the following fundamentalproblems, i.e., lack of consideration of the exhibition period of eachindividual item. A transaction database as shown in FIG. 5 where thetransaction database db^(1,3) is assumed to be segmented into threepartitions P₁, P₂, P₃, which correspond to the three time granularitiesfrom January 2001 to March 2001. Suppose that min_supp=30% andmin_conf=75%. Each partition is scanned sequentially for the generationof candidate 2-itemsets in the first scan of the database db^(1,3).After scanning the first segment of 4 transactions, i.e., partition P₁,2-itemsets {BD, BC, CD, AD} are sequentially generated as shown in FIG.6a. In addition, each potential candidate itemset c∈C₂ has twoattributes (1) c.start which contains the partition number of thecorresponding starting partition when c was added to C₂, and (2) c.countwhich contains the number of occurrences of c since c was added to C₂.Since there are four transactions in P₁, the partial minimal support is┌4*0.3┐=2. Such a partial minimal support is called the filteringthreshold. Itemsets whose occurrence counts are below the filteringthreshold are removed. Then, as shown in FIG. 6a, only {BD,BC}, markedby “O”, remain as candidate itemsets (of type β in this phase since theyare newly generated) whose information is then carried over to the nextphase P₂ of processing. Similarly, after scanning partition P₂, theoccurrence counts of potential candidate 2-itemsets are recorded (oftype α and type β). From FIG. 6a, it is noted that since there are also4 transactions in P₂, the filtering threshold of those itemsets carriedout from the previous phase (that become type α candidate itemsets inthis phase) is ┌(4+4)*0.3┐=3 and that of newly identified candidateitemsets (i.e., type β candidate itemsets) is ┌4*0.3┐=2. It can be seenthat we have 3 candidate itemsets in C₂ after the processing ofpartition P₂, and one of them is of type α and two of them are of typeβ. Finally, partition P₃ is processed by the first algorithm. Theresulting candidate 2-itemsets are C₂={BC, CE, BF} as shown in FIG. 6b.Note that though appearing in the previous phase P₂, itemset {DE} isremoved from C₂ once P₃ is taken into account since its occurrence countdoes not meet the filtering threshold then, i.e. 2<3. However, we dohave one new itemset, i.e. BF, which joins the C₂ as a type β candidateitemset. Consequently, we have 3 candidate 2-itemsets generated by PPM,and two of them of type α and one of them is type β. Note that only 3candidate 2-itemsets are generated by the first algorithm. Aftergenerating C₂ from the first scan of database db^(1,3), we employ thescan reduction technique [26] and use C₂ to generate C_(k) (k=2, 3, . .. , m), where C_(m) is the candidate last-itemsets. Instead ofgenerating C₃ from L₂*L₂, a C₂ generated by the algorithm can be used togenerate the candidate 3-itemsets and its sequential C_(k − 1)^(′)

[0074] can be utilized to generate C_(k)^(′).

[0075] Clearly, a C₃′ generated from C₂*C₂, instead of from L₂*L₂, willhave a greater than |C₃| where C₃ is generated from L₂*L₂. However,since the |C₂| generated by first algorithm is very close to thetheoretical minimum, i.e. |L₂|, the |C₃′| not much larger than |C₃|.Similarly, the |C_(k)′| is close to |C_(k)|. Since C₂={BC, CE, BF}, nocandidate k-itemset is generated in this example where k≧3. ThusC_(k)′={BC, CE, BF} are termed to be the candidate maximal temporalitemsets (TIs), i.e. BC^(1,3), CE^(2,3), CE^(3,3), with a maximumexhibition period of each candidate.

[0076] Before we preprocess the second scan of the database db^(1,3) togenerate L_(kS), all candidate SIs of candidate TIs can be propagated,and then added into C_(k)′. For instance, as shown in FIG. 6c, bothcandidate l-itemsets B^(1,3), and C^(1,3) are derived from BC^(1,3).Moreover, since BC^(1,3), for example, is a candidate 2-itemset, itssubsets, i.e. B^(1,3), and C^(1,3) are derived from B^(1,3). Moreover,since BC^(1,3), for example, is a candidate 2-itemset, its subsets,i.e., B^(1,3) and C^(1,3), should potentially be candidate itemsets. Asa result 9 candidate itemsets, i.e. (B^(1,3), B^(3,3), C^(1,3), C^(2,3),E^(2,3), and F^(3,3) are frequent SIs in this example. As shown in FIG.6c, after all frequent TI and SI itemsets are identified, thecorresponding general temporal association rules can be derived in astraightforward manner. Explicitly, the general temporal associationrule of (X

Y)^(1,n) holds if conf ((X

Y)^(1,n))>min_conf.

[0077] If we let n be the number of partitions with a time granularity,e.g. business-week, month, quarter, year, to name a few, in database D.In the model considered, db^(1,n) denotes the part of the transactiondatabase formed by a continuous region from partition P_(t) to partitionP_(n), and${{db}^{t.n}} = {\sum\limits_{{h = t},n}^{\quad}\quad {P_{h}}}$

[0078] where db^(t,n) ⊂D. An item X^(x start,n) is termed as a temporalitem of x, meaning that P_(x start) is the starting partition of x and nis the partition number of the last database partition retrieved. Againconsider the database in FIG. 5. Since database D records thetransaction data from January 2001 to March 2001, database D isintrinsically segmented into three partitions P₁, P₂, and P₃ inaccordance with the “month” granularity. As a consequence, a partialdatabase db^(2,3) ⊂D consists of partitions P₂ and P₃. A temporal itemE^(2,3) denotes that the exhibition period of E^(2,3) is from thebeginning time of partition P₂ to the end time of partition P₃. Anitemset X^(t,n) is called a maximal temporal itemset in a partialdatabase db^(t,n) if t is the latest starting partition number of allitems belonging to X in database D and n is the partition number of thelast partition in db^(t,n) retrieved. In addition let N_(db) _(^(t,n))(X^(t,n)) be the number of transactions in partial database db^(t,n)that contain itemset X^(t,n), and |db^(t,n)| is the number oftransactions in the partial database db^(t,n). FIG. 7 shows a flowchartdemonstrating the first algorithm which is further outlined below, wherethe first algorithm is decomposed into five sub-procedures for ease ofdescription.

[0079] Initial Sub-procedure: The database D is partitioned into npartitions and set CF=0

[0080] db^(1,n)=The partial database of D formed by a continuous regionfrom P_(l) to P_(n)

[0081] |db^(1,n)=number of transactions in db^(i,n)

[0082] X^(1,n)=A temporal itemset in partial database db^(1,n)

[0083] MCP(X^(t,n))=(t,n) The maximal common exhibition period of anitemset X

[0084] (x

Y)^(t,n)=A general temporal association rule in db^(t,n)

[0085] supp((X

Y)^(t,n))=The support of X

Y in partial database db^(t,n)

[0086] conf((X

Y)^(t,n))=The support of X

Y in partial database db^(t,n)

[0087] s=Minimum support threshold required

[0088] min_leng=Minimum length of exhibition period required

[0089] TI=A maximal temporal itemset

[0090] SI=A corresponding temporal sub-itemset of TI

[0091] n=Number of partitions;

[0092] CF=cumulative filter

[0093] P=partition

[0094] C=set of progressive candidate itemsets generated by databasedb^(l,j)

[0095] L=determined frequent itemset${{1.\quad {{db}^{1,n}}} = {\sum\limits_{{k = 1},n}^{\quad}\quad {P_{k}}}};$

[0096] 2. CF=0;

[0097] 3. begin for k=1 to n //1^(st) scan of db^(1,n)

[0098] 4. begin for each 2-itemset X₂^(t, n) ∈ P_(k)

[0099] where n−t>min_leng

[0100] 5. if (X₂∈CF)

[0101] 6. X₂.count=N_(pk)(I);

[0102] 7. X₂.start=k;

[0103] 8. if (X₂.Count≧s*|P_(k)|)

[0104] 9. CF=CF∪X₂;

[0105] 10. if (X₂∈CF)

[0106] 11. X₂.count=X₂.count+N_(pk)(X₂);$12.\quad {{if}\left( {{X_{2}.{count}} < \left\lceil {s*{\sum\limits_{{m = {X_{2}{start}}},k}^{\quad}\quad {P_{m}}}} \right\rceil} \right)}$

[0107] 13. CF=CF−X₂;

[0108] 14. end

[0109] 15. end

[0110] 16. select C₂ from X₂ where X₂∈PS;

[0111] 17. CF=0

[0112] Sub-procedure II: Generate candidate TIs and SIs with the schemeof database scan reduction

[0113] 18. begin while (C_(k)≠0 & k≧2)

[0114] 19. C_(k+1)=C_(k)*C_(k);

[0115] 20. k=k+1;

[0116] 21. end22.  X_(k)^(t, n) = {X_(k)^(t, n) ⊆ X_(k)|X_(k) ∈ C_(k)};

[0117] //Candidate TIs generation23.  SI(X_(k)^(t, n)) = {X_(k)^(t, n) ⊆ subset  of  X_(k)^(t, n)|j < k};

[0118] //Candidate SIs of TIs generation $\begin{matrix}{{{{24.\quad {CF}} = {{CF}\bigcup{{SI}\left( X_{k}^{t,n} \right)}}};}{{{25.\quad {Select}\quad X_{k}^{t,n}\quad {into}\quad C_{k}\quad {where}\quad X_{k}^{t,n}} \in {PS}};}} & \quad\end{matrix}$

[0119] Sub-procedure III: Generate all frequent TIs and Sis with the2^(nd) scan of database D

[0120] 26. Begin for k=1 to n27.  For  each  itemset  X_(k)^(t, n) ∈ C_(k)28.  X_(k)^(t, n) ⋅ count = X_(k)^(t, n) ⋅ count + N_(p  h)(X_(k)^(t, n));

[0121] 29. end

[0122] 30. for each itemset X_(k)^(t, n) ∈ C_(k)

31.  if(X_(k)^(t, n) ⋅ count ≥ ⌈min_sup  p^(*)b^(t, n)⌉)32.  L_(k) = L_(k)⋃X_(k)^(t, n);

[0123] 33. end

[0124] Sub-procedure IV: Prune out the redundant frequent Sis from L_(k)

[0125] 34. for each SI itemset X_(k)^(t, n) ∈ L_(k)

[0126] 35. If (does not exist TIX_(j)^(t, n) ⊆ L_(j)j > k)

[0127] 36. L_(k) = L_(k) − X_(k)^(t, n);

[0128] 37. end

[0129] 38. return L_(k);

[0130] In essence, Sub-procedure 1 first scans partition p₁ for i=1 ton, to find the set of all local frequent 2-itemsets in p₁. Note that CFis a superset of the set of all frequent 2-itemsets in D. The firstalgorithm constructs CF incrementally by adding candidate 2-itemset toCF and starts counting the number of occurrences for each candidate2-itemset X₂ in CF whenever X₂ is added to CF. If the cumulativeoccurrences of a candidate 2-itemset X₂ does not meet the partialminimum support required, X₂ is removed from the progressive screen CF.From step 3 to step 15 of sub-procedure 1, the first algorithm processesone partition at a time for all partitions. When processing partitionP_(l), each potential candidate 2-itemset X₂ is read and saved to CFwhere its exhibition period, i.e., n−t, should be larger than theminimum constraint exhibition period min_leng required. The number ofoccurrences of an itemset X₂ and its starting partition which keeps itfirst occurrence in CF are recorded in X₂.count and X₂.startrespectively. As such, in the end of processing db^(1,h), only anitemset, whose X₂.count≧$\left\lceil {s*{\sum\limits_{{m = {1\quad {start}}},h}\quad {P_{m}}}} \right\rceil,$

[0131] will be kept in CF. Note that a large amount of infrequent TIcandidates will be further reduced with the early pruning technique bythis progressive portioning processing. Next, in Step 16 we select C₂from X₂∈CF and set CF=0 in Step 17.

[0132] In sub-procedure II, with the scan reduction scheme [26], C₂produced by the first scan of database is employed to generate C_(kS)(k≧3) in main memory from step 18 to step 21. Recall that X_(k) ^(t,n)is a maximal temporal k-itemset in a partial database db^(t,n). In Step22, all candidate TIs, i.e., X_(k)^(t, n)s,

[0133] are generated from X_(k)∈C_(k) with considering the maximalcommon exhibition period of itemset X_(k) where MCP(I_(k))=(t,n). Afterthat from step 23 to step 25 we generate all corresponding temporalsub-itemsets of X_(k)^(t, n),

[0134] i.e., SI(X_(k)^(t, n)),

[0135] to join into CF.

[0136] Then from Step 26 to Step 33 of Sub-procedure III we begin thesecond database scan to calculate the support of each itemset in CF andt find out which candidate itemsets are really frequent TIs and SIs indatabase D. As a result, those itemsets whose X_(k)^(t, n).

[0137] count≧┌s*|db^(t,nn)|┐ are the frequent temporal itemsets L_(k)s.

[0138] Finally, in sub-procedure IV, we have to prune out thoseredundant frequent SIs and TI itemsets are not frequent in database Dfrom the L_(k)s. The output of the first algorithm consists of frequentitemsets L_(k)s of database D. According to these output L_(k)s in Step38, all kinds of general temporal association rules implied in databaseD can be generated in a straightforward method.

[0139] Note that the first algorithm is able to filter out falsecandidate itemsets in P_(l) with a hash table. Same as in [26] using ahash table to prune candidate 2-itemsets, i.e., C₂ in each accumulativeongoing partition set P_(i) of transaction database, the CPU and memoryoverhead of algorithm can be further reduced. The first algorithmprovides very efficient solutions for mining general temporalassociation rules. This feature is, as described earlier is veryimportant for mining the publication-like databases whose data are beingexhibited from different starting times. In addition, the progressivescreen produced in each processing phase constitutes the key componentto realize the mining of general temporal association rules. Note thatthe first algorithm proposed has several important advantages, includingwith judiciously employing progressive knowledge in the previous phases,the algorithm is able to reduce the amount of candidate itemsetsefficiently which in turn reduces the CPU and memory overhead; and owingto the small number of candidate sets generated, the scan reductiontechnique can be applied efficiently. As a result, only two scan of thetime series database is required.

[0140] A second algorithm for incremental mining of association rules isalso formed on the basis of the pre-processing algorithm. The secondalgorithm effectively controls memory utilization by the technique ofsliding-window partition. More importantly, the second algorithm isparticularly powerful for efficient incremental mining for an ongoingtime-variant transaction database. Incremental mining is increasing usedfor record-based databases whose data are being continuously added.Examples of such applications include Web log records, stock marketdata, grocery sales data, transactions in electronic commerce, and dailyweather/traffic. Incremental mining can be decomposed into twoprocedures: a Preprocessing procedure for mining on the originaltransaction database, and an Incremental procedure for updating thefrequent itemsets for an ongoing time-variant transaction database. Thepreprocessing procedure is only utilized for the initial mining ofassociation rules in the original database, e.g., db^(1,n). For thegeneration of mining association rules in db^(2,n+1), db^(3,n+2),db^(l,j), and so on, the incremental procedure is employed. Consider thedatabase in FIG. 8. Assume that the original transaction databasedb^(1,3) is segmented into three partitions, i.e. {P₁, P₂, P₃}, in thepreprocessing procedure. Each partition is scanned sequentially for thegeneration of candidate 2-itemsets in the first scan of the databasedb^(1,3). After scanning the first segment of 3 transactions, i.e.,partition P₁, 2-itemsets {AB, AC, AE, AF, BC, BE, CE} are generated asshown in FIG. 9a. In addition, each potential candidate itemset c∈C₂ hastwo attributes: c.start which contains the identity of the startingpartition when c was added to C₂, and c.count which contains the numberof occurrences of c since c was added to C₂. Since there are threetransactions in P₁, the partial minimal support is ┌3*0.4┐=2. Such apartial minimal support is called the filtering threshold in this paper.Itemsets whose occurrence counts are below the filtering threshold areremoved. Then, as shown in FIG. 9a, only {AB, AB, BC}, marked by “O”,remain as candidate itemsets (of type β in this phase since they arenewly generated) whose information is then carried over to the nextphase of processing.

[0141] Similarly, after scanning partition P₂, the occurrence counts ofpotential candidate 2-itemsets are recorded (of type α and type β). FromFIG. 9a, it is noted that since there are also 3 transactions in P₂, thefiltering threshold of those itemsets carried out from the previousphase (that become type α candidate itemsets in this phase) is┌(3+3)*0.4┐=3 and that of newly identified candidate itemsets (i.e.,type β candidate itemsets) is ┌3*0.4┐=2. It can be seen from FIG. 9athat we have 5 candidate itemsets in C₂ after the processing ofpartition P₂, and 3 of them are type α and 2 of them are type β.

[0142] Finally, partition P₃ is processed by the second algorithm. Theresulting candidate 2-itemsets are C₂={AB, AC, BC, BD, BE} as shown inFIG. 9a. Note that though appearing in the previous phase P₂ itemset{AD} is removed from C_(s) once P₃ is taken into account since itsoccurrence count does not meet the filtering threshold then, i.e. 2<3.However, we do have one new itemset, i.e., BE, which joins the C₂ as atype β candidate itemset. Consequently, we have 5 candidate 2-itemsetsgenerated by the second algorithm, and 4 of them are of type α and oneof them is of type β.

[0143] After generating C₂ from the first scan of database db^(1,3), weemploy the scan reduction technique and use C₂ to generate C_(k) (k=2,3, . . . , n), where C_(n) is the candidate 3-itemsets and itssequential C_(k − 1)^(′)

[0144] can be utilized to generate C_(k)′. Clearly, a C₃′ generated fromC₂*C₂ instead of from L₂*L₂, will have a size greater than |C₃| where C₃is generated from L₂*L₂. However, since the |C₂| generated by the secondalgorithm is very close to the theoretical minimum, i.e. |L₂|, the |C₃′|is not much larger than |C₃|. Similarly, the |C_(k)′| to close to|C_(k)|. All C_(k)′ can be stored in main memory, and we can find L_(k)(k=1, 2, . . . , n) together when the second scan of the databasedb^(1,3) is performed. Thus, only two scans of the original databasedb^(1,3) are required in the preprocessing step. In addition, instead ofrecording all L_(kS) in main memory, we only have to keep C₂ in mainmemory for the subsequent incremental mining of an ongoing time varianttransaction database.

[0145] The merit of the second algorithm mainly lies in its incrementalprocedure. As depicted in FIG. 9b, the mining database will be movedfrom db^(1,3) to db^(2,4). Thus, some transactions, i.e., t₁, t₂, and t₃are deleted from the mining database and other transactions, i.e., t₁₀,t₁₁, and t₁₂, are added. For ease of exposition, this incremental stepcan also be divided into three sub-steps: (1) generating C₂ inD⁻=db^(1,3)−Δ⁻, (2) generating C₂ in db^(2,4)=D⁻+Δ⁺ and (3) scanning thedatabase db^(2,4) only once for the generation of all frequent itemsetsL_(k). In the first sub-step db^(1,3)−Δ⁻=D⁻, we check out the prunedpartition P₁ and reduce the value of c.count and set c.start=2 for thosecandidate itemsets c where c.start=1. It can be seen that itemsets {AB,AC, BC} were removed. Next, in the second sub-step, we scan theincremental transactions in P₄ as type β candidate itemsets. Finally, inthe third sub-step, we use C₂ to generate C_(k)′ as mentioned above.With scanning db^(2,4) only once, the second algorithm obtains frequentitemsets {A, B, C, D, E, F, BD, BE, DE} in db^(2,4). The improvementachieved by the second algorithm is even more prominent as the amount ofthe incremental portion increases and also as the size of the databasedb^(l,j) increases.

[0146] The second algorithm is illustrated in the flowchart of FIG. 10and shown below wherein:

[0147] db^(1,n)=The partial database of D formed by a continuous regionfrom P_(l) to P_(n)

[0148] s=Minimum support required

[0149] |P_(k)|=Number of transactions in partition P_(k)

[0150] N_(pk)(I)=Number of transactions in partition P_(k) that containitemset I

[0151] |db^(1,n)(I)|=Number of transactions in db^(1,n) that containitemset I

[0152] C^(l,j)=The set of progressive candidate itemsets generated bydatabase db^(l,j)

[0153] Δ⁻=The deleted portion of an ongoing transaction database

[0154] D⁻=The unchanged portion of an ongoing transaction database

[0155] Δ⁺=The added portion of an ongoing transaction database

[0156] Preprocessing procedure of the second algorithm:

[0157] 1. n=Number of partitions;${{2.\quad {{db}^{1,n}}} = {\sum\limits_{{k = 1},n}^{\quad}\quad {P_{k}}}};$

[0158] 3 CF=0;

[0159] 4. begin for k=1 to n //1^(st) scan of db^(1,n)

[0160] 5. begin for each 2-itemset I∈P_(k)

[0161] 6. if (I∈CF)

[0162] 7. I.count=N_(pk)(I);

[0163] 8. I.start=k;

[0164] 9. if (I.count≧s*|P_(k)|)

[0165] 10. CF=CF∪I;

[0166] 11. if (I∈CF)

[0167] 12. I.count=I.count+N_(pk)(I);$13.\quad {{if}\left( {{I \cdot {count}} < \left\lceil {s*{\sum\limits_{{m = {I \cdot {start}}},k}^{\quad}\quad {P_{m}}}} \right\rceil} \right)}$

[0168] 14. CF=CF−I;

[0169] 15. end

[0170] 16. end

[0171] 17. select C₂^(1, n)

[0172] from I where I∈CF

[0173] 18. keep C₂^(1, n)

[0174] in main memory;

[0175] 19. h=2; //C₁ is given

[0176] 20. begin while (C_(h)^(1, n) ≠ 0)

[0177] //Database scan reduction21.  C_(h + 1)^(1, n) = C_(h)^(1, n) * C_(h)^(1, n);

[0178] 22 h=h+1;

[0179] 23. end

[0180] 24. refresh I.count=0 where I ∈ C_(h)^(1, n);

[0181] 25. begin for k=1 to n //2^(nd) scan of db^(1,n)

[0182] 26. for each itemset I ∈ C_(h)^(1, n)

[0183] 27. I count=I.count+N_(pk)(I);

[0184] 28. end

[0185] 29. for each itemset I ∈ C_(h)^(1, n)

[0186] 30. if (I.count≧┌s*|db^(1,n)|┐)

[0187] 31. L_(h)=L_(h)∪I;

[0188] 32. end

[0189] 33. return L_(h);

[0190] Incremental procedure of the second algorithm:

[0191] 1. Original database=db^(m,n);

[0192] 2. New database=db^(l,j);

[0193] 3. Database removed${\Delta^{-} = {\sum\limits_{{k = m},{i - 1}}\quad P_{k}}};$

[0194] 4. Database database${\Delta^{+} = {\sum\limits_{{k = {n + 1}},j}\quad P_{k}}};$

${5.\quad D^{-}} = {\sum\limits_{{k = i},n}\quad P_{k}}$

[0195] 6. db^(l,j)=db^(m,n)−Δ⁻+Δ⁺;

[0196] 7. loading C₂^(m, n)

[0197] of db^(m,n) into CF where I ∈ C₂^(m, n)

[0198] 8. begin for k=m to i−1//one scan of Δ⁻

[0199] 9. begin for each 2-itemset I∈P_(k)

[0200] 10. if (I∈CF and I.start≦k)

[0201] 11. I.count=I.count−N_(pk)(I);

[0202] 12. I.start=k+1;$13.\quad {{if}\left( {{I \cdot {count}} < \left\lceil {s*{\sum\limits_{{m = {I \cdot {start}}},n}\quad {P_{m}}}} \right\rceil} \right.}$

[0203] 14. CF=CF−1;

[0204] 15. end

[0205] 16. end

[0206] 17. begin for k=n+1 to j //one scan of Δ⁺

[0207] 18. begin for each 2-itemset I∈P_(k)

[0208] 19. if (I∉CF)

[0209] 20. I.count=N_(pk)(I);

[0210] 21. I.start=k;

[0211] 22. if (I.count≧s*|P_(k)|)

[0212] 23. CF=CF∪I;

[0213] 24. if (I∈CF)

[0214] 25. I.count=I.count+N_(pk)(I);$26.\quad {{if}\left( {{I \cdot {count}} < \left\lceil {s*{\sum\limits_{{m = {I \cdot {start}}},k}\quad {P_{m}}}} \right.} \right.}$

[0215] 27. CF=CF−1;

[0216] 28. end

[0217] 29. end

[0218] 30. select C₂^(i, j)

[0219] from I where I∈CF;

[0220] 31. keep C₂^(i, j)

[0221] in main memory;

[0222] 32. h=2//C₁ is well known.

[0223] 33. Begin while (C_(h)^(i, j) ≠ 0)

[0224] //Database scan reductionC_(h + 1)^(i, j) = C_(h)^(i, j) * C_(h)^(i, j);

[0225] 35. h=h+1;

[0226] 36. end.

[0227] 37. Refresh I.count=0 where I ∈ C_(h)^(i, j);

[0228] 38. begin for k=i to j //only one scan of db^(l,j)

[0229] 39. for each itemset I ∈ C_(h)^(i, j)

[0230] 40. I.count=I.count+N_(pk)(I);

[0231] 41 end

[0232] 42. for each itemset I ∈ C_(h)^(i, j)

[0233] 43. if (I.count≧┌s*|db^(l,j)|┐)

[0234] 44. L_(h)=L_(h)∪I;

[0235] 45. end

[0236] 46. return L_(h);

[0237] The preprocessing procedure of the second algorithm is outlinedbelow. Initially, the database db^(1,n) is partitioned into n partitionsby executing the preprocessing procedure (in Step 2), and CF, i.e.cumulative filter, is empty (in Step 3). Let C₂^(i, j)

[0238] be the set of progressive candidate 2-itemsets generated bydatabase db^(l,j). It is noted that instead of keeping L_(ks) in themain memory, the second algorithm only records C₂^(1, n)

[0239] which is generated by the preprocessing procedure to be used bythe incremental procedure.

[0240] From Step 4 to Step 16, the algorithm processes one partition ata time for all partitions. When partition P_(l) is processed, eachpotential candidate 2-itemset is read and saved to CF. The number ofoccurrences of an itemset I and its starting partition are recorded inI.count and I.start, respectively. An itemset, whose I.count≧$\left\lceil {s*{\sum\limits_{{m = {I.{start}}},k}\quad {P_{m}}}} \right\rceil,$

[0241] will be kept in CF. Next, we select C₂^(1, n)

[0242] from I where I∈CF and keep C₂^(1, n)

[0243] in main memory for the subsequent incremental procedure. Withemploying the scan reduction technique from Step 19 to Step 23,C_(h)^(1, n)s(h ≥ 3)

[0244] are generated in main memory. After refreshing I.count=0 whereI ∈ C_(h)^(1, n),

[0245] we begin the last scan of database for the preprocessingprocedure from Step 25 to Step 28. Finally, those itemsets whoseI.count≧┌s*|db^(1,n)|┐ are the frequent itemsets.

[0246] In the incremental procedure of the second algorithm, D⁻indicates the unchanged portion of an ongoing transaction database. Thedeleted and added portions of an ongoing transaction database aredenoted by Δ⁻ and Δ⁺, respectively. It is worth mentioning that thesizes of Δ⁻ and Δ⁺, i.e. |Δ⁺| and |Δ⁻| respectively, are not required tobe the same. The incremental procedure of the algorithm is devised tomaintain frequent itemsets efficiently and effectively. The incrementalstep can be divided into three sub-steps: (1) generating C₂ inD⁻=db^(1,3)−Δ⁻, (2) generating C₂ in db^(2,4)=D⁻+Δ⁺ and (3) scanning thedatabase db^(2,4) only once for the generation of all frequent itemsetsL_(k). Initially, after some update activities, old transactions Δ⁻ areremoved from the database db^(m,n) and new transactions Δ⁺ are added (instep 6). Note that Δ⁻⊂db^(m,n). Denote the updated database as db^(l,j).Note that db^(l,j)=db^(m,n)−Δ⁻+Δ⁺. We denote the unchanged transactionsby D⁻=db^(m,n)−Δ⁻=db^(i,j)−Δ⁺. After loading C₂^(m, n)

[0247] of db^(m,n) into CF where I ∈ C₂^(m, n),

[0248] we start the first sub-step, i.e., generating C₂ inD⁻=db^(m,n)−Δ⁻. This sub-step tries to reverse the cumulative processingwhich is described in the preprocessing procedure. From Step 8 to Step16, we prune the occurrences of an itemset I, which appeared beforepartition P_(l), by deleting the value I.count where I∈CF and I.start<i.Next, from Step 17 to Step 36, similarly to the cumulative processingSection 3.2.1, the second sub-step generates new potential C₂^(i, j)

[0249] in db^(l,j)=D⁻+Δ⁺ and employs the scan reduction technique togenerate C_(h)^(i, j)s

[0250] from C₂^(i, j).

[0251] Finally, to generate new L_(kS) in the updated database, we scandb^(l,j) for only once in the incremental procedure to maintain frequentitemsets. Note that C₂^(i, j)

[0252] is kept in main memory for the next generation of incrementalmining.

[0253] Note that the second algorithm is able to filter out falsecandidate itemsets in P_(l) with a hash table. Same as in [24], using ahash table to prune candidate 2-itemsets, i.e., C₂, in each accumulativeongoing partition set P_(l) of transaction database, the CPU and memoryoverhead of the algorithm can be further reduced. The second algorithmprovides an efficient solution for incremental mining, which isimportant for the mining of record-based databases whose data arefrequently and continuously added, such as web log records, stock marketdata, grocery sales data, and transactions in electronic commerce, toname a few.

[0254] The third algorithm based on the pre-processing algorithm regardsweighted association rules in a time-variant database. In the thirdalgorithm, the importance of each transaction period is first reflectedby proper weight assigned by the user. Then, the algorithm partitionsthe time-variant database in light of weighted periods of transactionsand performs weighted mining. The third algorithm first partitions thetransaction database in light of weighted periods of transactions andthen progressively accumulates the occurrence count of each candidate2-itemset based on the intrinsic partitioning characteristics. With thisdesign, the algorithm is able to efficiently produce weightedassociation rules for applications where different time periods areassigned with different weights. The algorithm is also designed toemploy a filtering threshold in each partition to early prune out thosecumulatively infrequent 2-itemsets. The feature that the number ofcandidate 2-itemsets generated by function W (□) in the weighted periodP_(l) of the database D. Formally, we have the following definitions:

[0255] In the first definition let N_(Pl)(X) be the number oftransactions in partition P_(l) that contain itemset X. Consequently,the weighted support value of an itemset X can be formulated asS^(W)(X) = ∑N_(Pi)(X) × W(P_(i)).

[0256] As a result, the weighted support ratio of an itemset X issupp^(W)$(X) = {\frac{S^{W}(X)}{\Sigma {P_{i}} \times {W\left( P_{i} \right)}} \cdot}$

[0257] In accordance with the first definition, an itemset X is termedto be frequent when the weighted occurrence frequency of X is largerthan the value of min-supp required, i.e., supp^(W) (X)>min_supp, intransaction set D. The weighted confidence of a weighted associationrule (X

y)^(W) is then defined below.

[0258] In the second definition conf^(W)$\left( X\Rightarrow Y \right) = {\frac{\sup \quad {p^{W}\left( {X\bigcup Y} \right)}}{\sup \quad {p^{W}(X)}}.}$

[0259] In the third definition an association rule X

Y is termed a frequent weighted association rule (X

y)^(W) if and only if its weighted support is larger than minimumsupport required, i.e., supp^(W)(XuY)>min_supp, and the weightedconfidence conf^(W) (X

Y) is larger than minimum confidence needed, i.e., conf^(W) (X

Y)>min_conf Explicitly, the third algorithm explores the mining ofweighted association rules, denoted by (X

Y)^(W), which is produced by two newly defined concepts ofweighted-support and weighted-confidence in light of the correspondingweights in individual transactions. Basically, an association rule X

Y is termed to be a frequent weighted association rule (X

Y)^(W) if and only if its weighted support is larger than minimumsupport required, i.e., supp^(W)(X∪Y)>min_conf. Instead of using thetraditional support threshold min_S^(T)=┌|D|×min_sup p┐ as a minimumsupport threshold for each item, a weighted minimum support, denoted bymin min_S^(W) = {ΣP_(i) × W(P_(i))} × min_sup  p,

[0260] is employed for the mining of weighted associatio rules, whereP_(i)

[0261] and W(P_(l)) represent the amount of partial transactions andtheir corresponding weight values by a weighted function W(·) in theweighted period Pi of the database D. Let N_(pl)(X) be the number oftransactions in partition Pi that contain itemset X. The support valueof an itemset X can then be formulated asS^(W)(X) = ∑N_(Pi)(X) × W(P_(i)).

[0262] As a result, the weighted support ration of an itemset X issupp^(W)$(X) = {\frac{S^{W}(X)}{\Sigma {P_{i}} \times {W\left( P_{i} \right)}} \cdot}$

[0263] Looking at FIG. 11, the minimum transaction support andconfidence are assumed to be min_supp=30% and min_conf=75%,respectively. A set of time-variant database indicates the transactionrecords from January 2001 to March 2001. The starting date of eachtransaction item is also given. Based on traditional mining techniques,the support threshold is denoted as min_S^(T)=┌2×0.3┐=4 where 12 is thesize of tranaction set D. It can be seen that only {B, C, D, E, BC} canbe termed as frequent itemsets since their occurences in thistransaction database are all larger than the value of support thresholdmin_S^(T). Thus, rule C

B is termed as a frequent association rule with support supp(C∪B)=41.67% and confidence conf(C

B)=83.33%. If we assign weights wherein W(P₁)=0.5, W(P₂)=1, and W(P₃)=2,we have this newly defined support threshold asmin_S^(W)={4×0.5+4×1+4×2}×0.3=4.2, we have weighted association rules,i.e., (C

B)^(W) with relative weighted support supp^(w) (C∪B)=35.7% andconfidence${{conf}^{W}\left( C\Rightarrow B \right)} = {\frac{\sup \quad {p^{W}\left( {C\bigcup B} \right)}}{\sup \quad {p^{W}(C)}} = {83.3\% \quad {and}\quad \left( F\Rightarrow B \right)^{W}}}$

[0264] with relative weighted support supp^(W) (F∪B)=42.8% andconfidence${{conf}^{W}\left( F\Rightarrow B \right)} = {\frac{\sup \quad {p^{W}\left( {F\bigcup B} \right)}}{\sup \quad {p^{W}(F)}} = {100{\%.}}}$

[0265] Initially, a time-variant database D is partitioned into npartitions based on the weighted periods of transactions. The algorithmis illustrated in the flowchart in FIG. 13 and is further outlinedbelow, where algorithm is decomposed into four sub-procedures for easeof description. C₂ is the set of progressive candidate 2-itemsetsgenerated by database D. Recall that N_(Pl)(X) is the number oftransactions in partition P_(l) that contain itemset X and W(P_(l)) isthe corresponding weight of partition P_(l).

[0266] Procedure 1: Initial Partition

[0267] 1. |D|=Σ_(l=1,n)|P_(l)|;

[0268] Procedure 2: Candidate 2-Itemset Generation

[0269] 2. begin for i=1 to n //1^(st) scan of D

[0270] 3. begin for each 2-itemset X₂∈P_(l)

[0271] 4. if (X₂∉C₂)

[0272] 5. X₂.count=N_(Pl)(X₂)×W(Pi);

[0273] 6. X₂.start=h;

[0274] 7. if (X₂.count≧min_supp×|P_(l)|×W(P_(l)))

[0275] 8. C₂=C₂∪X₂;

[0276] 9. if (X₂∈C₂)

[0277] 10. X₂.count=X₂.count+N_(Pl)(X₂)×W(P_(l));

[0278] 11. if (X₂.count<min_supp×Σ_(m=X) ₂ _(start,l)(|P_(m)|×W(P_(m))))

[0279] 12. C₂=C₂−X₂;

[0280] 13. end

[0281] 14. end

[0282] Procedure 3: Candidate k-itemset Generation

[0283] 15. begin while (C_(k)≠0 & k≧2)

[0284] 16. C_(k+1)=C_(k)*C_(k);

[0285] 17. k=k+1;

[0286] 18. end

[0287] Procedure 4: Frequent Itemset Generation

[0288] 19. begin for i=1 to n

[0289] 20. begin for each itemset X_(k)∈C_(k)

[0290] 21. X_(k).count=X_(k).count+N_(Pl)(X_(k))×W(P_(l));

[0291] 22. end

[0292] 23. begin for each itemset X_(k)∈C_(k)

[0293] 24. if(X_(k).count ≥ min_supp × ∑_(m = 1, n)(P_(m) × W(P_(m))))

[0294] 25. L_(k)=L_(k)∪X_(k);

[0295] 26. end

[0296] 27. return L_(k);

[0297] Since there are four transactions in P₁, the partial weightedminimal support is min_S^(W)(P₁)=4×0.3×0.5=0.6. Such a partial weightedminimal support is called the filtering threshold. Itemsets whoseoccurrence counts are below the filtering threshold are removed. Then,as shown in FIG. 12a, only {BD,BC}, marked by “O”, remain as candidateitemsets (of type B in this phase since they are newly generated) whoseinformation is then carried over to the next phase P₂ of processing.

[0298] Similarly, after scanning partition P₂, the occurrence counts ofpotential candidate 2-itemsets are recorded (of type α and type B). FromFIG. 12a, it is noted that since there are also 4 transactions in P₂,the filtering threshold of these itemsets carried out from the previousphase (that become type α candidate itemsets in this phase) ismin_S^(W)(P₁+P₂)=4×0.3×0.5+4×0.3×1=1.8 and that of newly identifiedcandidate itemsets (i.e., type B candidate itemsets) ismin_S^(W)(P₂)=4×0.3×1=1.2. It can be seen in FIG. 12b that we have 3candidate itemsets in C₂ after the processing of partition P₂, and oneof them is of type α and two of them are of type B.

[0299] Finally, partition P₃ is processed by the third algorithm. Theresulting candidate 2-itemsets are C₂={BC, CE, BF} as shown in FIG. 12b.Note that though appearing in the previous phase P₂, itemset {DE} isremoved from C₂ once P₃ is taken into account since its occurrence countdoes not met the filtering threshold then, i.e. 2<3.6. However, we dohave one new itemset, i.e. {BF}, which joins the C₂ as a type Bcandidate itemset. Consequently, we have 3 candidate 2-itemsetsgenerated by the third algorithm and two of them are of type α and oneof them is of type B. Note that only 3 candidate 2-itemsets aregenerated by the third algorithm.

[0300] After generating C₂ from the first scan of database D, we employthe scan reduction technique.

[0301] In essence, the region ration of an itemset is the support ofthat itemset if only the part of transaction database db^(l,j) isconsidered.

[0302] Lemma 1: A 2-itemset X₂ remains in the C₂ after the processing ofpartition P_(j) if and only if there exists an i such that for anyinteger t in the interval [i,j],r_(l,t)(X₂)≧min_S^(W)(db^(l,t)), wheremin_S^(W)(db^(l,j)) is the minimal weighted support required.

[0303] Lemma 1 leads to Lemma 2 below.

[0304] Lemma 2: An itemset X₂ remains in C₂ after the processing ofparition P_(j) if and only if there exists an i such thatr_(l,j)(X₂)≧min_S^(W)(db^(l,j)), where min_S^(W)(db^(l,j)) is theminimal support required

[0305] Lemma 2 leads to the following theorem which states thecorrectness of algorithm PWM.

[0306] Theorem 1: If an itemset X is a frequent itemset, then X will bein the candidate set of itemsets produced by algorithm PWM.

[0307] It follows from Theorem 1 that when W (□)=1, the frequentitemsets generated by the third algorithm will be the same as thoseproduced by the association rule mining algorithms.

[0308] Various additional modifications may be made to the illustratedembodiments without departing from the spirit and scope of theinvention. Therefore, the invention lies in the claims hereinafterappended.

What is claimed is:
 1. A pre-processing method for data mining,comprising: dividing a database into a plurality of partitions; scanninga first partition for generating a plurality of candidate itemsets;developing a filtering threshold based on each partition and removingthe undesired candidate itemsets; and scanning a second partition whiletaking into consideration the desired candidate itemsets from the firstpartition.
 2. The method of claim 1, wherein the generation of candidateitemsets includes the steps of: assigning a candidate itemset a value ofwhen an itemset was added to an accumulator; and adding a value for thenumber of occurrences of the itemset from the point the itemset to theaccumulator.
 3. The method of claim 1, wherein the step of removing theundesired candidate itemsets is based on a minimum threshold requirementas defined by the filtering threshold.
 4. A method for mining generaltemporal association rules, comprising: dividing a database into aplurality of partitions including a first partition and a secondpartition; scanning the first partition for generating candidateitemsets; developing a filtering threshold based on the scanned firstpartition and removing the undesired candidate itemsets; scanning thesecond partition while taking into consideration the desired candidateitemsets from the first partition; performing a scan reduction processby considering an exhibition period of each candidate itemset; scanningthe database to determine the support of each of the candidate itemsetsin the filtering threshold; and pruning out redundant candidate itemsetsthat are not frequent in the database and outputting the final itemsets.5. The method of claim 4, wherein the generation of candidate itemsetsincludes the step of assigning a candidate itemset a value of when anitemset was added to an accumulator and adding a value for the number ofoccurrences of the itemset from the point the itemset to theaccumulator.
 6. The method of claim 4, wherein the removal of undesiredcandidate itemsets is based on a minimum threshold requirement asdefined by the filtering threshold.
 7. A method for incremental miningcomprising: dividing a database into a plurality of partitions,including a first partition and a second partition; scanning the firstpartition for generating a plurality of candidate itemsets; developing afiltering threshold based on each of the partitions and removingundesired candidate itemsets of the candidate itemsets; removingtransactions from the candidate itemset based on a previous partition;and adding transactions to the itemset based on a next partition.
 8. Themethod of claim 6, wherein the generation of the candidate itemsetsincludes the step of assigning a candidate itemset a value of when anitemset was added to an accumulator, and adding a value for the numberof occurrences of the itemset from the point the itemset to theaccumulator.
 9. The method of claim 6, wherein the removal of theundesired candidate itemsets is based on a minimum threshold requirementas defined by the filtering threshold.